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Abstract 

The performance of neural network classifiers is determined by a num¬ 
ber of hyperparameters, including learning rate, batch size, and depth. A 
number of attempts have been made to explore these parameters in the 
literature, and at times, to develop methods for optimizing them. How¬ 
ever, exploration of parameter spaces has often been limited. In this note, 
I report the results of large scale experiments exploring these different 
parameters and their interactions. 


1 Datasets and Libraries 

All experiments reported here were carried out using the Torch library [1] and 
CUBA (some of the experiments have been reproduced on a smaller scale with 
other libraries). The dataset for all the experiments is MNIST [3j[2]. 

Characters were deskewed prior to all experiments. Deskewing significantly 
reduces error rates in nearest neighbor classifiers. Skew corresponds to a sim¬ 
ple one-parameter family of linear transformations in feature space and causes 
decision regions to become highly anisotropic. Without deskewing, differences 
in performance between different architectures might primarily reduce to their 
ability to “learn deskewing”. With deskewing, MNIST character classification 
become more of an instance of a typical classification problem. Prior results 
on classifying deskewed MNIST data both with neural networks and with other 
methods are shown in the table below. 


2 Logistic vs Softmax Outputs 

Multi-Layer Perceptrons (MLPs) used for classification usually attempt to ap¬ 
proximate posterior probabilities and use those as their discriminant function. 
Two common approaches to this are the use of least square regression with lo¬ 
gistic output units trained with a least square error measure (“logistic outputs”) 
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Method 

Test Error 

Preprocessing 

Reference 

Reduced Set SVM deg 5 
polynomial 

1 

deskewing 

LeCun et al. 1998 

SVM deg 4 polynomial 

1.1 

deskewing 

LeCun et al. 1998 

K-nearest-neighbors, L3 

1.22 

deskewing, noise 
removal, blur¬ 
ring, 2 pixel 
shift 

Kenneth Wilder, U. 
Chicago 

K-nearest-neighbors, L3 

1.33 

deskewing, noise 
removal, blur¬ 
ring, 1 pixel 
shift 

Kenneth Wilder, U. 
Chicago 

2-layer NN, 300 HU 

1.6 

deskewing 

LeCun et al. 1998 


Table 1: Other previously reported results on the MNIST database. 


and a softmax output layer (“softmax outputs”). In the limit of infinite amounts 
of training data, both approaches converge to true posterior probability esti¬ 
mates. Softmax output layers have the property that they are guaranteed to 
produce a normalized posterior probability distribution across all classes, while 
least square regression with logistic output units generates independent prob¬ 
ability estimates for each class membership without any guarantees that these 
probabilities sum up to one. 

Softmax is often preferred, although there is no obvious theoretical reason 
why it should yield better discriminant functions or lower classification error 
for finite training sets. In OCR and speech recognition, some practitioners have 
observed that logistic outputs yield better posterior probability estimates and 
better results when combined with probabilistic language models. In addition, 
when the sum of the posterior probability estimates derived from logistic outputs 
differs significantly from unity, that is a strong indication that the input lies 
outside the training set and should be rejected. 

Figure shows a scatterplot of test vs training error for a large number 
of MLPs with one hidden layer at different learning rates, different number of 
hidden units, and different batch sizes. Such scatterplots show what error rates 
are achievable by the different architectures, hyperparameter choices, initial¬ 
izations, and order of sample presentations. The lowest points in the vertical 
direction indicate the lowest test set error achievable by the architecture in this 
set of experiments. The scatterplot shows that logistic outputs achieve test set 
error rates of about 1.0% vs 1.1% for softmax outputs. At the same time, logis¬ 
tic outputs never achieve zero percent training set error, while softmax outputs 
frequently do. 

In order to ascertain that the difference in test set error between the two 
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learning rate 


Figure 1: Training and test error for MLPs with logistic outputs (blue) and 
softmax output (red). Note that softmax outputs achieve down to zero percent 
training error (assigned to an error of le-4) but logistic outputs give overall 
better performance on new training samples. 
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Figure 2: Learning rate and batch size effects depending on output layer type. 
This is a scatterplot for all networks that yield test set error rates of less than 
1.5%, with color indicating the test set error. Softmax outputs yield the best 
results for a learning rate that is about an order of magnitude smaller than 
logistic outputs. 
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Figure 3: The complete range of parameters explored in the MLP experiments 
reported in this section. In this plot, color indicates error rate; circle size in¬ 
dicates the number of hidden units; a circle with a border represents softmax 
outputs. 


architectures is due to the architectures themselves, it is important to ensure 
that the space of hyperparameters (learning rates, batch sizes, number of hidden 
units) has been explored sufficiently. Note that, as Figureshows, the in order 
to yield low error rates, softmax outputs require learning rates that are about 
an order of magnitude lower than logistic outputs. 

Figure demonstrates that the range of parameters (learning rates, batch 
sizes) has been explored fully; above learning rates of lei, both softmax and 
logistic output models diverge, and at small learning rates, both fail to learn in 
a reasonable amount of time. 

The fact that logistic outputs yield 10% lower relative error rates on such 
a simple and widely studied problem and architecture compared to softmax 
outputs does not prove that “logistic outputs are better than softmax outputs”, 
but it suggests that it is worth testing both logistic and softmax outputs on any 
particular problem to see which one yields lower test set error. 

3 Batch size Effects 

Optimizing the weights of a neural network can be carried out by stochastic 
gradient descent (updating after each sample) or by full gradient descent (com¬ 
puting a gradient on the parameters from the entire training set). In between 
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Figure 4: Batch size and learning rate vs. error rate (logistic outputs). See the 
text for an explanation. 


those two extremes is batched gradient descent. In batched gradient descent, 
we update the parameters of the MLP to reduce the error for a small sam¬ 
ple (“batch”) of training samples, typically consisting of between 10 and 1000 
samples. 

Computationally, using batches instead of individual training samples allows 
for greater parallelism; each layer of an MLP computes effectively a function 
like: 


y = a{M • x) 

For simple SGD, x is some d-dimensional vector, but for batched gradient 
descent, x is a d x 6 dimensional matrix, where b is the batch size. The matrix 
multiplication M • x can be evaluated much more efficiently and in parallel than 
evaluating b individual matrix-vector products in single-sample updates. In 
fact, if we use b processors, we can compute updates for b samples in roughly 
the same time as we would otherwise use for a single sample in SGD. 

When using batch training, a common convention is to rescale the learning 
rate A —> ^ This means that as we increase the batch size 6, we need to scale 
up the learning rate proportionately. This is the convention we use in these 
experiments. 

Figure]^ shows a scatterplot of trained networks with good test set error vs. 
batch size and learning rate. There are three apparent limits on performance: 

1. At the lower end, we have a soft transition from well performing net- 
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works to poorly performing networks. The explanation of this is that at 
low learning rates, the network learns too slowly to reach low test set 
errors within the limited number of training steps. The reason this soft 
transition slopes upwards is due to the use of normalized learning rates. 
Without learning rate normalization, this line would remain horizontal 
and independent of batch size. 

2. At a batch size of 1, there is a maximum learning rate; beyond that learn¬ 
ing rate, the stochastic gradient descent optimization diverges. Without 
batch size normalization of learning rate, this upper limit would exist 
independent of batch size; due to batch size normalization, this line of 
divergence slopes upwards, parallel to the soft lower bound on learning 
rates. 

3. There is a third, unexpected, constant limit on the batch-normalized learn¬ 
ing rate. 

The original single sample learning rate determines the speed of convergence 
of the stochastic gradient descent algorithm; within its region of convergence, 
halving the learning rate approximately doubles the time needed to reach a 
given test set error, since each gradient update represents simply a step towards 
the minimum along some path. This reasoning also applies for batch updates. 
The constant upper limit on the batch-normalized learning rate corresponds to 
a single sample learning rate that decreases proportionally to batch size. 

The consequence is that, as long as the maximum usable batch normalized 
learning rate increases proportionally with batch size, we benefit from paral¬ 
lelization in terms of overall speedup of learning. Once we enter the regime 
where the upper limit of batch normalized learning rates is independent of batch 
size, further parallelization does not speed up training. 

Without further experimentation, we can only guess at the source of the 
upper limit on the batch-normalized learning rate. As a simplified model, as¬ 
sume that the limit on the learning rate for single sample updates is due to 
some subset of the input vectors (e.g., vectors that generate particularly large 
gradients). For batch sizes that contain, on average, only one of those input 
vectors, we can continue to use the original learning rate, but once we use a 
batch size that contains, on average, two of those vectors, we have to cut the 
learning rate in half in order to keep the magnitude of the update within the 
range that allows convergence. (In practice, we are not necessarily looking at 
individual samples but subspaces of the input.) This analysis suggests possible 
strategies for improving training performance with large batch sizes that will be 
explored elsewhere. 

Regardless of the speed of optimization, we can also ask the question of how 
the test set error of the final network depends on batch size. This is shown 
in Figure We see that increasing batch size generally results in worse test 
set errors for both logistic outputs and softmax outputs. The dependence is 
somewhat stronger for logistic outputs. In addition, logistic outputs appear to 
yield networks with a higher variability. Note that the differences in error rates 
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Figure 5: Test error by batch size for logistic and softmax outputs. Note 
that logistic output units achieve the lowest error. Also note that for larger 
batch sizes, the advantage of logistic output units over softmax output units 
disappears. 


in this plot are much smaller than the differences in error rates found in the 
previous learning rate plots; this plot makes small but significant differences 
among the very best models visible. 

The observations on the relationship between batch sizes and learning rates 
above also have implications for hyperparameter optimization. In particular, the 
hyperparameter search at large batch sizes becomes harder because the range of 
learning rates that yield good networks is considerably smaller than it is at small 
batch sizes (Figure]^. Batch sizes that are too large therefore not only waste 
computational resources through parallelism that does not result in a speedup 
for learning, they may actually make the hyperparameter search harder. 

4 Convolutional Layers 

The above results were all obtained for non-convolutional networks with a single 
hidden layer. How do they generalize to convolutional networks? There are, of 
course, many different kinds of convolutional architectures we could investigate. 
The simplest architecture places a single convolutional layer at the input of the 
network. 

Not surprisingly, adding a convolutional layer results in significantly lower 
test set error (0.69% test set error) compared to non-convolutional networks 
(1% test set error), as seen in Figure 

For convolutional networks, we also observe that small batch sizes yield the 
best test set errors. In fact, large batch sizes never reach comparably low error 
rates (Figure]^. 
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Figure 6: At large batch sizes, the range of hyperparameter resulting in good 
test set errors is considerably smaller than at small batch sizes. 
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Figure 7: Training error vs test set error for non-convolutional (blue) and 
convolutional (red) networks. 
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Figure 8: Test set error by batch size for convolutional networks. Output units 
are softmax, and the plot shows the best networks across all learning rates and 
number of convolutional units. Note the significant increase in best achievable 
error rates with increasing batch size. 
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Figure 9: A comparison of sigmoid and ReLU units at the hidden layer, com¬ 
bined with softmax and logistic outputs. 

5 ReLU Units (Not Convolutional) 

Another popular architectural choice is to replace the sigmoidal nonlinearity in 
hidden layers with rectifying linear units (ReLU). To explore the effects of this on 
the results, networks were trained with all four combinations of sigmoid/ReLU 
hidden units and softmax/sigmoid output units. The scatterplot of results is 
shown in Figure 9. 

We see in Figure that softmax outputs achieve zero percent test set error 
for both kinds of hidden layer nonlinearities. Furthermore, ReLU hidden units 
outperform sigmoidal hidden units for either kind of output layer. The overall 
best performing combination was logistic output units with ReLU hidden units, 
resulting in a test set error of 0.92%. 

6 ReLU Units (Convolutional) 

The previous results for ReLU units are interesting, given the low test set error 
rate and low dependence on batch size. It’s interesting to see whether we can 
reproduce those results for convolutional networks. However, in the case of 
convolutional networks, we find a significant batch size dependence, and that 
the difference between sigmoidal and ReLU hidden units is considerably smaller 
than for non-convolutional networks. Furthermore, some softmax networks also 
come close in performance (Figure 13). 
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Figure 10: Dependence of test set error on batch size. We see that for siguioidal 
hidden units (blue, cyan), large batch sizes perform considerably worse than 
small batch sizes. For ReLU hidden units (red, magenta), the dependence is 
considerably weaker, although smaller batch sizes still seem to have a slight 
advantage. 
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Figure 11: Dependence of test set error on the number of hidden units. For 
sigmoidal hidden units (blue, cyan), there is little improvement of test set error 
with increasing numbers of hidden units. For ReLU hidden units, there is a 
strong improvement in test set error with the number of hidden units. The 
maximum number of hidden units tested was 2000, although it looks like larger 
numbers of hidden units might result in even better performance. 
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Figure 12: Error rates (indicated by color) vs batch size, learning rate, and 
number of hidden units (indicated by circle size). These scatterplots suggest that 
the parameter ranges for all four conditions were explored fairly completely. 



Figure 13: Test error vs. batch size for convolutional networks with ReLU 
units. 
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Figure 14: The batch size and learning rate parameter space explored for the 
comparison of convolutional ReLU and sigmoidal networks. 
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Figure 15: Test set error by network depth for networks with sigmoidal hidden 
units (blue) and ReLU hidden units (red). 

7 Deep ReLU Networks 

ReLU networks also appear to be highly successful for training deep networks. 
We explore this architectural variant in these experiments by comparing deep 
networks with ReLU hidden units and sigmoidal hidden units. We find that the 
performance of deep networks with sigmoidal units degrades with depth, while 
deep networks with ReLU networks yield good performance even at large (eight 
layer) depths (Figure 15). However, increasing depth for ReLU networks does 
not result in better test set performance. The deterioration of test set error 
with increasing depth for sigmoidal hidden units is probably due to effects like 
vanishing gradients, something that ReLU networks do not seem to suffer from 
to the same degree. 

For both types of deep networks, we can ask again how test set error depends 
on batch size. Figure 16 shows that for both types of deep networks and across 
a range of hidden layers, the best achievable test set error generally increases 
with increasing batch size. 

8 Conclusions and Discussion 

Machine learning algorithms cannot be totally ordered by performance and there 
is no single best learning algorithm. Nevertheless, machine learning benchmarks 
in general, and benchmarks on MNIST in particular, tell us something about 
how machine learning algorithms compare on typical classification tasks, and 
what kind of architectural features influence performance significantly. That 
is, the conclusions we can draw from such benchmarks are not so much about 
which algorithm is “better”, but rather which algorithmic choices may affect the 
outcome positively or negatively. 

Perhaps the most important result from these benchmarks is how complex 
the interaction between different architectural features and conditions is; per- 
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Figure 16: Test set error vs batch size for deep networks with ReLU (red) and 
sigmoidal (blue) hidden layers. Circle size indicates number of hidden layers. 
Notice that there is a significant decrease in test set performance with increasing 
batch size. 
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Figure 17: Test set error (indicated by color with red being high, blue being 
low) vs ^ hidden layers and batch size. Note that deep networks are much more 
sensitive to large batch sizes during training than shallow networks. 
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Figure 18: Parameters spaces explored by the deep ReLU and sigmoidal net¬ 
works. The parameter spaces show that the learning rate parameters were 
explored sufficiently well for both network types. 


formance improvements that can be demonstrated in simple networks do not 
add up when combined together into the same architecture. 

Furthermore, many benchmarks that have been carried out in the literature 
may have been hampered by limited sets of training conditions. For example, 
benchmarking logistic vs. softmax outputs at larger batch sizes suggests that 
there is little difference between the two methods; however, at small batch 
sizes, logistic outputs significantly outperform softmax outputs on MNIST data 
(Figure]^. Unless both methods are tested at small batch sizes, the significantly 
better performance of logistic outputs is not revealed. As a second example, for 
some architectures, ReLU networks show no batch size dependencies, while other 
network architectures do show such dependencies. 

It is important to remember that the MNIST dataset is not necessarily rep¬ 
resentative of other classification problems: it has a small number of classes, the 
prior probability is uniform, the number of training samples is small compared 
to many other problems, all geometric variability (translations, rotations, skew) 
has been removed, and the input vectors are binary. Therefore, more important 
than the results about what works are the result of what unexpectedly doesn’t 
work well even in such a simple case. 

Based on the experiments reported here, we observe: 

• For many problems, increasing batch sizes in a parallel implementation 
results in no speedup in training because the per-sample learning rate 
needs to be scaled down proportionately to batch size. Furthermore, large 
batch sizes may intrinsically limit the performance of networks. Finally, 
hyperparameter optimization may get harder for larger batch sizes, as 
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the range of feasible learning rates (and other parameters) gets narrower. 
Therefore, it is a good idea to carry out experiments with single sample 
updates and small batch sizes. 

• Softmax outputs may yield lower training errors than logistic outputs, but 
often also yield higher test set errors. Therefore, it is a good idea to try 
both kinds of outputs when training neural networks on different tasks. 
In doing so, it is important to try a wide range of learning rates, since the 
optimal learning rates for the two kinds of outputs are different. 

• For non-convolutional networks, ReLU hidden layer units perform signifi¬ 
cantly better than sigmoidal hidden layer units in these experiments; they 
also show lower batch size dependencies and scale to much larger numbers 
of hidden units. For convolutional networks, however, these effects were 
not observed, suggesting that both sigmoidal and ReLU non-linearities 
should be tried. 

• It is much easier to train deep networks using ReLU hidden layers than 
hidden layers with sigmoidal non-linearities. However, additional depth 
does not improve test set error for either sigmoidal or ReLU units. For 
deep networks, we also observed batch size dependencies. In addition, 
the range of good learning rates shifts and becomes smaller for deeper 
networks. 

Generally, these results suggest the following strategy for training new networks: 

• Start with single sample updates, both during initial exploration and hy¬ 
perparameter search. 

• When exploring new problems, compare softmax and logistic outputs, as 
well as ReLU and sigmoidal hidden units. 

• Although deep ReLU networks can be trained, be sure to try shallow ReLU 
networks as well, and test a wide range of learning rates. 

We note that there has been an extensive literature on various improved op¬ 
timization methods for neural network learning, methods for learning hyper¬ 
parameters, and benchmarks of MLP performance. It is impossible to do this 
literature justice in this technical report. However, a few general observations 
should suffice: 

• The methods described in this paper all relied on simple SGD training, 
yet yield excellent performance compared to other reported results. In 
particular, optimization or hyperparameter selection methods that yield 
significantly worse results than those reported here are of questionable 
utility. 

• Generally speaking, hyperparameter optimization for these these kinds of 
problems does not seem to be particularly critical; networks yield similar 
performance over a broad range of hyperparameters. 
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• Hyperparameter optimization should not optimize for the best expected 
test set error of the resulting networks, but for the minimal error over a 
collection of multiple trained models. 
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